Premiums paid by customer is the major revenue source for insurance companies. Default in premium payments results in significant revenue losses and hence insurance companies would like to know upfront which type of customers would default premium payments. The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn import metrics
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier
)
from xgboost import XGBClassifier
data=pd.read_csv(r'C:\Users\davis\OneDrive\Desktop\Python\Capstone_Project\Insurance_Premium.csv')
data.head()
| id | perc_premium_paid_by_cash | age_in_days | Income | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.32 | 11330 | 90050 | 0 | 0 | 0 | 0 | 3 | 3 | 1 | 98.81 | 8 | A | Rural | 5400 | 0 |
| 1 | 2 | 0.00 | 30309 | 156080 | 0 | 0 | 0 | 1 | 3 | 1 | 1 | 99.07 | 3 | A | Urban | 11700 | 0 |
| 2 | 3 | 0.02 | 16069 | 145020 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 99.17 | 14 | C | Urban | 18000 | 0 |
| 3 | 4 | 0.00 | 23733 | 187560 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 99.37 | 13 | A | Urban | 13800 | 0 |
| 4 | 5 | 0.89 | 19360 | 103050 | 7 | 3 | 4 | 0 | 2 | 1 | 0 | 98.80 | 15 | A | Urban | 7500 | 1 |
data.tail()
| id | perc_premium_paid_by_cash | age_in_days | Income | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 79848 | 79849 | 0.25 | 25555 | 64420 | 0 | 0 | 0 | 1 | 2 | 4 | 0 | 99.08 | 10 | A | Urban | 5700 | 0 |
| 79849 | 79850 | 0.00 | 16797 | 660040 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 99.65 | 9 | B | Urban | 28500 | 0 |
| 79850 | 79851 | 0.01 | 24835 | 227760 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 99.66 | 11 | A | Rural | 11700 | 0 |
| 79851 | 79852 | 0.19 | 10959 | 153060 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 99.46 | 24 | A | Urban | 11700 | 0 |
| 79852 | 79853 | 0.00 | 19720 | 324030 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 99.80 | 7 | D | Rural | 3300 | 0 |
data.shape
(79853, 17)
Dataset has 79852 rows and 16 columns.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 79853 entries, 0 to 79852 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 79853 non-null int64 1 perc_premium_paid_by_cash 79853 non-null float64 2 age_in_days 79853 non-null int64 3 Income 79853 non-null int64 4 Count_3-6_months_late 79853 non-null int64 5 Count_6-12_months_late 79853 non-null int64 6 Count_more_than_12_months_late 79853 non-null int64 7 Marital Status 79853 non-null int64 8 Veh_Owned 79853 non-null int64 9 No_of_dep 79853 non-null int64 10 Accomodation 79853 non-null int64 11 risk_score 79853 non-null float64 12 no_of_premiums_paid 79853 non-null int64 13 sourcing_channel 79853 non-null object 14 residence_area_type 79853 non-null object 15 premium 79853 non-null int64 16 default 79853 non-null int64 dtypes: float64(2), int64(13), object(2) memory usage: 10.4+ MB
perc_premiums_paid and risk_score are a floats. Percentage with decimal point. sourcing_channel and residence_area_type is object. Non-numerical value. All other columns are integers. Numerical values.
pd.DataFrame(
data={
"% of Missing Values": round(data.isna().sum() / data.isna().count() * 100, 2)
}
)
| % of Missing Values | |
|---|---|
| id | 0.0 |
| perc_premium_paid_by_cash | 0.0 |
| age_in_days | 0.0 |
| Income | 0.0 |
| Count_3-6_months_late | 0.0 |
| Count_6-12_months_late | 0.0 |
| Count_more_than_12_months_late | 0.0 |
| Marital Status | 0.0 |
| Veh_Owned | 0.0 |
| No_of_dep | 0.0 |
| Accomodation | 0.0 |
| risk_score | 0.0 |
| no_of_premiums_paid | 0.0 |
| sourcing_channel | 0.0 |
| residence_area_type | 0.0 |
| premium | 0.0 |
| default | 0.0 |
data.isna().sum()
id 0 perc_premium_paid_by_cash 0 age_in_days 0 Income 0 Count_3-6_months_late 0 Count_6-12_months_late 0 Count_more_than_12_months_late 0 Marital Status 0 Veh_Owned 0 No_of_dep 0 Accomodation 0 risk_score 0 no_of_premiums_paid 0 sourcing_channel 0 residence_area_type 0 premium 0 default 0 dtype: int64
No missing data.
data.nunique()
id 79853 perc_premium_paid_by_cash 101 age_in_days 833 Income 24165 Count_3-6_months_late 14 Count_6-12_months_late 17 Count_more_than_12_months_late 10 Marital Status 2 Veh_Owned 3 No_of_dep 4 Accomodation 2 risk_score 672 no_of_premiums_paid 57 sourcing_channel 5 residence_area_type 2 premium 30 default 2 dtype: int64
data.drop(['id'],axis=1,inplace=True)
Dropping ID. No value.
cat_col = [
"residence_area_type",
"sourcing_channel",
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 40)
Urban 48183 Rural 31670 Name: residence_area_type, dtype: int64 ---------------------------------------- A 43134 B 16512 C 12039 D 7559 E 609 Name: sourcing_channel, dtype: int64 ----------------------------------------
Two categorical columns.
residence_area_type as two possible unique values.
sourcing_channel has 5 possible unique values.
cols = data.select_dtypes(['object'])
cols.columns
Index(['sourcing_channel', 'residence_area_type'], dtype='object')
for i in cols.columns:
data[i] = data[i].astype('category')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 79853 entries, 0 to 79852 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 perc_premium_paid_by_cash 79853 non-null float64 1 age_in_days 79853 non-null int64 2 Income 79853 non-null int64 3 Count_3-6_months_late 79853 non-null int64 4 Count_6-12_months_late 79853 non-null int64 5 Count_more_than_12_months_late 79853 non-null int64 6 Marital Status 79853 non-null int64 7 Veh_Owned 79853 non-null int64 8 No_of_dep 79853 non-null int64 9 Accomodation 79853 non-null int64 10 risk_score 79853 non-null float64 11 no_of_premiums_paid 79853 non-null int64 12 sourcing_channel 79853 non-null category 13 residence_area_type 79853 non-null category 14 premium 79853 non-null int64 15 default 79853 non-null int64 dtypes: category(2), float64(2), int64(12) memory usage: 8.7 MB
data['age_in_days']=data['age_in_days']/365
data=data.rename(columns = {'age_in_days':'age'})
Change age in days to years. Change column name to age.
data["age"] = data["age"].astype("int64")
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| perc_premium_paid_by_cash | 79853.0 | 0.314667 | 0.334935 | 0.0 | 0.03 | 0.17 | 0.54 | 1.00 |
| age | 79853.0 | 51.607404 | 14.270484 | 21.0 | 41.00 | 51.00 | 62.00 | 103.00 |
| Income | 79853.0 | 208847.171177 | 496582.597257 | 24030.0 | 108010.00 | 166560.00 | 252090.00 | 90262600.00 |
| Count_3-6_months_late | 79853.0 | 0.248369 | 0.691102 | 0.0 | 0.00 | 0.00 | 0.00 | 13.00 |
| Count_6-12_months_late | 79853.0 | 0.078093 | 0.436251 | 0.0 | 0.00 | 0.00 | 0.00 | 17.00 |
| Count_more_than_12_months_late | 79853.0 | 0.059935 | 0.311840 | 0.0 | 0.00 | 0.00 | 0.00 | 11.00 |
| Marital Status | 79853.0 | 0.498679 | 0.500001 | 0.0 | 0.00 | 0.00 | 1.00 | 1.00 |
| Veh_Owned | 79853.0 | 1.998009 | 0.817248 | 1.0 | 1.00 | 2.00 | 3.00 | 3.00 |
| No_of_dep | 79853.0 | 2.503012 | 1.115901 | 1.0 | 2.00 | 3.00 | 3.00 | 4.00 |
| Accomodation | 79853.0 | 0.501296 | 0.500001 | 0.0 | 0.00 | 1.00 | 1.00 | 1.00 |
| risk_score | 79853.0 | 99.067392 | 0.725892 | 91.9 | 98.83 | 99.18 | 99.52 | 99.89 |
| no_of_premiums_paid | 79853.0 | 10.863887 | 5.170687 | 2.0 | 7.00 | 10.00 | 14.00 | 60.00 |
| premium | 79853.0 | 10924.507533 | 9401.676542 | 1200.0 | 5400.00 | 7500.00 | 13800.00 | 60000.00 |
| default | 79853.0 | 0.062590 | 0.242226 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 |
prec_premium_paid_by_cash average is 31%. Range 0%-100%. age average is 51 years old. Range is 21 to 103. Income average is 208847. Range is 24030 to 90262600. Count_3-6_months_late average is .24. Range is 0 to 13. Count_6-12_months_late average is 0.078. Range is 0 to 17. Count_more_than_12_months_late average is 0.059. Range is 0 to 11. Veh_Owned averge is 1.99. Range is 1-3. risk_score average is 99.06. Range is 91.9 to 99.98. no_of_premiums_paid average is 10.86. Range is 2 to 60. premium average is 10924. Range is 1200 to 60000.
UNIVARIATE ANALYSIS
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, color="orange"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2, color="tab:cyan"
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="purple", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data["perc_premium_paid_by_cash"])
Data is right skewed. No outliers. 50% of customers pay less than 17% cash.
histogram_boxplot(data["age"])
Boxplot indicates outliers. Older age. Normally distributed with mean of 51.
data[(data.age>95)]
| perc_premium_paid_by_cash | age | Income | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1158 | 0.19 | 96 | 45050 | 0 | 0 | 0 | 0 | 2 | 3 | 1 | 99.83 | 3 | A | Rural | 3300 | 0 |
| 1671 | 0.04 | 96 | 43300 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 99.89 | 5 | A | Urban | 1200 | 0 |
| 1922 | 1.00 | 103 | 48130 | 0 | 0 | 0 | 1 | 3 | 1 | 1 | 99.07 | 5 | A | Rural | 1200 | 0 |
| 3974 | 0.13 | 96 | 240150 | 0 | 0 | 0 | 1 | 3 | 4 | 0 | 99.81 | 11 | A | Rural | 1200 | 0 |
| 8480 | 0.01 | 97 | 98700 | 0 | 1 | 0 | 1 | 2 | 2 | 0 | 99.07 | 6 | A | Rural | 7500 | 0 |
| 9387 | 0.01 | 98 | 28550 | 1 | 0 | 0 | 0 | 2 | 2 | 1 | 98.44 | 10 | A | Urban | 5700 | 0 |
| 16269 | 0.00 | 97 | 45130 | 0 | 0 | 1 | 0 | 1 | 4 | 0 | 99.07 | 17 | A | Urban | 5700 | 0 |
| 30506 | 0.02 | 96 | 212560 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 99.88 | 7 | A | Urban | 11700 | 0 |
| 32427 | 0.11 | 102 | 102580 | 0 | 0 | 0 | 0 | 1 | 4 | 1 | 99.27 | 9 | B | Urban | 7500 | 0 |
| 33180 | 0.00 | 99 | 195100 | 0 | 0 | 0 | 1 | 3 | 3 | 1 | 99.07 | 17 | A | Urban | 9600 | 1 |
| 33787 | 0.14 | 97 | 175110 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 99.44 | 14 | A | Urban | 13800 | 0 |
| 41278 | 0.01 | 97 | 240150 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 99.07 | 9 | A | Rural | 9600 | 0 |
| 42063 | 0.03 | 99 | 191210 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 98.11 | 14 | A | Rural | 7500 | 0 |
| 45077 | 0.01 | 97 | 200080 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 99.07 | 4 | A | Rural | 9600 | 0 |
| 46149 | 0.03 | 101 | 50050 | 0 | 0 | 0 | 1 | 3 | 3 | 1 | 99.87 | 7 | A | Rural | 5700 | 0 |
| 49070 | 0.00 | 101 | 86570 | 2 | 0 | 0 | 0 | 1 | 4 | 0 | 99.07 | 8 | A | Rural | 7500 | 1 |
| 50614 | 0.00 | 97 | 69510 | 0 | 0 | 0 | 0 | 2 | 4 | 0 | 99.07 | 7 | A | Rural | 3300 | 0 |
| 69077 | 0.00 | 96 | 42560 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 99.46 | 4 | A | Urban | 3300 | 0 |
| 74847 | 0.01 | 102 | 99060 | 0 | 0 | 0 | 1 | 2 | 2 | 1 | 99.89 | 5 | A | Urban | 3300 | 0 |
| 77800 | 0.02 | 99 | 285040 | 0 | 0 | 0 | 1 | 1 | 3 | 0 | 99.88 | 17 | A | Rural | 13800 | 0 |
data.drop(index=data[data.age>95].index,inplace=True)
sns.distplot(data["Income"])
<AxesSubplot:xlabel='Income', ylabel='Density'>
sns.distplot(np.log(data["Income"]), axlabel="Log(Income)")
<AxesSubplot:xlabel='Log(Income)', ylabel='Density'>
data["Income_log"] = np.log(data["Income"])
data.drop(columns=["Income"], inplace=True)
histogram_boxplot(data["Income_log"])
data[(data.Income_log>14)]
| perc_premium_paid_by_cash | age | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | 1.00 | 45 | 0 | 0 | 0 | 1 | 3 | 3 | 1 | 99.58 | 8 | D | Urban | 60000 | 0 | 14.727813 |
| 445 | 0.01 | 70 | 0 | 0 | 0 | 1 | 3 | 2 | 1 | 99.70 | 12 | A | Urban | 60000 | 0 | 14.436152 |
| 623 | 0.00 | 66 | 1 | 0 | 0 | 1 | 2 | 3 | 0 | 99.07 | 15 | A | Urban | 60000 | 0 | 14.375172 |
| 801 | 0.47 | 44 | 1 | 0 | 0 | 0 | 3 | 4 | 0 | 99.89 | 11 | B | Rural | 60000 | 0 | 17.286703 |
| 991 | 0.66 | 57 | 0 | 0 | 0 | 0 | 3 | 4 | 1 | 99.85 | 19 | A | Rural | 60000 | 0 | 14.216017 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 78903 | 0.02 | 75 | 2 | 0 | 0 | 0 | 2 | 1 | 0 | 99.86 | 12 | A | Urban | 60000 | 0 | 14.115556 |
| 78913 | 0.00 | 44 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 99.85 | 8 | D | Urban | 13800 | 0 | 14.316346 |
| 79194 | 0.94 | 54 | 2 | 0 | 0 | 0 | 1 | 1 | 1 | 99.76 | 11 | B | Urban | 60000 | 0 | 14.404674 |
| 79416 | 0.42 | 65 | 2 | 0 | 0 | 1 | 1 | 3 | 1 | 99.08 | 13 | B | Rural | 60000 | 0 | 14.508698 |
| 79579 | 0.40 | 58 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 99.50 | 16 | C | Urban | 60000 | 0 | 14.174988 |
339 rows × 16 columns
histogram_boxplot(data["Count_3-6_months_late"])
Right skewed.
histogram_boxplot(data["Count_6-12_months_late"])
Right skewed.
histogram_boxplot(data["Count_more_than_12_months_late"])
Right skewed.
histogram_boxplot(data["Veh_Owned"])
Average 2 vehicles owned.
histogram_boxplot(data["No_of_dep"])
Average 2.5 dependents. No outliers.
histogram_boxplot(data["risk_score"])
Left skewed. Average 99%. Outliers in the lower range.
data[(data.risk_score<92)]
| perc_premium_paid_by_cash | age | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17379 | 1.00 | 41 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 91.96 | 6 | A | Urban | 18000 | 0 | 12.652392 |
| 22477 | 0.04 | 54 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 91.98 | 9 | A | Rural | 5700 | 0 | 10.877103 |
| 49332 | 0.06 | 44 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 91.90 | 11 | A | Urban | 5700 | 0 | 10.452476 |
| 63359 | 0.02 | 62 | 0 | 0 | 0 | 1 | 3 | 2 | 1 | 91.96 | 9 | A | Rural | 3300 | 0 | 10.130225 |
| 67197 | 0.21 | 56 | 1 | 0 | 0 | 0 | 2 | 2 | 0 | 91.96 | 26 | A | Rural | 7500 | 0 | 12.881946 |
histogram_boxplot(data["no_of_premiums_paid"])
Right skewed. Average 10.86. Outliers in upper range.
data[(data.no_of_premiums_paid>25)]
| perc_premium_paid_by_cash | age | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | 0.19 | 56 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 98.31 | 32 | A | Rural | 7500 | 0 | 12.117733 |
| 467 | 0.23 | 73 | 7 | 0 | 3 | 0 | 3 | 1 | 1 | 98.08 | 26 | A | Rural | 20100 | 0 | 12.933670 |
| 635 | 0.60 | 64 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 99.01 | 26 | A | Urban | 20100 | 1 | 12.660804 |
| 673 | 0.60 | 50 | 1 | 0 | 0 | 1 | 3 | 2 | 0 | 99.01 | 36 | D | Urban | 49500 | 0 | 13.723647 |
| 680 | 0.21 | 61 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 99.28 | 26 | A | Rural | 9600 | 0 | 12.506733 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 79402 | 0.46 | 49 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 99.13 | 28 | B | Urban | 51600 | 0 | 13.169457 |
| 79415 | 0.59 | 69 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 96.95 | 26 | B | Rural | 34800 | 0 | 13.029436 |
| 79507 | 0.31 | 40 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 98.97 | 27 | A | Rural | 9600 | 0 | 12.072884 |
| 79760 | 0.17 | 60 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 98.37 | 26 | A | Rural | 7500 | 0 | 11.918990 |
| 79763 | 0.04 | 83 | 1 | 0 | 0 | 1 | 2 | 2 | 1 | 99.55 | 32 | A | Rural | 11700 | 0 | 12.190400 |
1121 rows × 16 columns
histogram_boxplot(data["premium"])
Right skewed. Average 10924. Outliers in upper range.
data[(data.premium>25000)]
| perc_premium_paid_by_cash | age | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | sourcing_channel | residence_area_type | premium | default | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 0.00 | 60 | 0 | 0 | 0 | 0 | 2 | 3 | 1 | 99.44 | 15 | A | Urban | 32700 | 0 | 13.097210 |
| 37 | 0.22 | 59 | 0 | 0 | 0 | 1 | 3 | 1 | 1 | 99.48 | 12 | B | Rural | 57900 | 0 | 13.543754 |
| 53 | 0.09 | 79 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 99.61 | 13 | A | Urban | 32700 | 0 | 12.948248 |
| 67 | 0.15 | 61 | 0 | 0 | 0 | 1 | 2 | 3 | 0 | 99.38 | 9 | C | Urban | 43200 | 0 | 12.915388 |
| 95 | 1.00 | 45 | 0 | 0 | 0 | 1 | 3 | 3 | 1 | 99.58 | 8 | D | Urban | 60000 | 0 | 14.727813 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 79750 | 0.43 | 69 | 1 | 0 | 0 | 0 | 3 | 2 | 1 | 99.76 | 24 | A | Rural | 55800 | 0 | 13.682025 |
| 79766 | 0.17 | 44 | 0 | 0 | 0 | 1 | 1 | 4 | 1 | 96.56 | 22 | C | Rural | 39000 | 0 | 13.641336 |
| 79773 | 0.12 | 56 | 0 | 0 | 0 | 0 | 3 | 4 | 1 | 99.67 | 20 | D | Urban | 39000 | 0 | 13.017181 |
| 79797 | 0.00 | 61 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 98.46 | 11 | A | Rural | 26400 | 0 | 12.710206 |
| 79849 | 0.00 | 46 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 99.65 | 9 | B | Urban | 28500 | 0 | 13.400056 |
5578 rows × 16 columns
def perc_on_bar(plot, feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["default"], palette="winter")
perc_on_bar(ax, data["default"])
Non-defaulted customers 93.7% Defaulted 6.3%.
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Accomodation"], palette="winter")
perc_on_bar(ax, data["Accomodation"])
Owned 50.1% Rented 49.9%
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Marital Status"], palette="winter")
perc_on_bar(ax, data["Marital Status"])
Married 49.9% Unmarried 50.1%
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Veh_Owned"], palette="winter")
perc_on_bar(ax, data["Veh_Owned"])
Vehicles owned between 1 and 3. Equally distributed. Average 2.5.
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["No_of_dep"], palette="winter")
perc_on_bar(ax, data["No_of_dep"])
Number of dependents 1-24.8%, 2-24.9%, 3-25.3%, 4-24.9%
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Count_3-6_months_late"], palette="winter")
perc_on_bar(ax, data["Count_3-6_months_late"])
0 times premium was paid 3-6 months late 83.8% 1 time 11.1% 2 times 3.2% 3 times 1.2% 4 times 0.5% 5 times 0.2% 6 times 0.1%
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Count_6-12_months_late"], palette="winter")
perc_on_bar(ax, data["Count_6-12_months_late"])
0 times premium was paid 6-12 months late 95.1% 1 time 3.4% 2 times 0.9% 3 times 0.4% 4 times 0.2% 5 times 0.1%
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Count_more_than_12_months_late"], palette="winter")
perc_on_bar(ax, data["Count_more_than_12_months_late"])
0 times premium was paid more than 12 months late 95.3% 1 time 3.8% 2 times 0.6% 3 times 0.2% 4 times 0.1%
sns.pairplot(data, hue="default")
<seaborn.axisgrid.PairGrid at 0x17994b07250>
There are overlaps. No clear distinction in the distribution of variables for people who have defaulted and did not default.
sns.lineplot(x='age',y='default',data=data)
<AxesSubplot:xlabel='age', ylabel='default'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="age", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='age'>
Median age of defaulters is less than non-defaulters. Younger customers more likely to default. Outliers in both plots.
sns.lineplot(x='perc_premium_paid_by_cash',y='default',data=data)
<AxesSubplot:xlabel='perc_premium_paid_by_cash', ylabel='default'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="perc_premium_paid_by_cash", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='perc_premium_paid_by_cash'>
Defaulters had higher median of premiums paid for with cash. Cash paying customers more likely to default. Most non-defaulters
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="risk_score", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='risk_score'>
Median risk score is higher for non-defaulters. Less chance for default with higher risk scores.
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="no_of_premiums_paid", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='no_of_premiums_paid'>
Outliers in upper range. Defaulters and non-defaulters very similar for number of premiums paid.
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="No_of_dep", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='No_of_dep'>
Defaulters have a higher number of dependents.
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="default", y="Marital Status", data=data, orient="vertical")
<AxesSubplot:xlabel='default', ylabel='Marital Status'>
Fairly even.
def stacked_plot(x):
sns.set(palette="nipy_spectral")
tab1 = pd.crosstab(x, data["default"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["default"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(20, 10))
plt.show()
stacked_plot(data["Count_3-6_months_late"])
default 0 1 All Count_3-6_months_late 0 64194 2686 66880 1 7672 1153 8825 2 1927 591 2518 3 666 288 954 4 216 158 374 5 101 67 168 6 37 31 68 7 13 10 23 8 9 6 15 9 2 2 4 10 0 1 1 11 0 1 1 12 0 1 1 13 0 1 1 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Higher the count the greater chance for default. 10-13 count is 100%.
stacked_plot(data["Count_6-12_months_late"])
default 0 1 All Count_6-12_months_late 0 72406 3503 75909 1 1851 828 2679 2 359 334 693 3 132 185 317 4 45 85 130 5 16 30 46 6 13 13 26 7 4 7 11 8 2 3 5 9 2 2 4 10 3 1 4 11 1 1 2 12 0 1 1 13 1 1 2 14 1 1 2 15 1 0 1 17 0 1 1 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Higher the count the greater chance for default. 12 and 17 count is 100%.
stacked_plot(data["Count_more_than_12_months_late"])
default 0 1 All Count_more_than_12_months_late 0 72308 3808 76116 1 2160 835 2995 2 270 228 498 3 66 85 151 4 23 25 48 5 6 7 13 6 2 4 6 7 1 2 3 8 1 1 2 11 0 1 1 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Higher the count the greater chance for default. 11 count is 100%.
stacked_plot(data["Marital Status"])
default 0 1 All Marital Status 0 37452 2570 40022 1 37385 2426 39811 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
No real differences indicated between married and non-married.
stacked_plot(data["Veh_Owned"])
default 0 1 All Veh_Owned 1 25072 1667 26739 2 24834 1678 26512 3 24931 1651 26582 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
No real differences indicates with number of vehicles owned.
stacked_plot(data["No_of_dep"])
default 0 1 All No_of_dep 1 18646 1190 19836 2 18640 1258 19898 3 18927 1282 20209 4 18624 1266 19890 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
No real differences indicates in number of dependents.
stacked_plot(data["Accomodation"])
default 0 1 All Accomodation 0 37359 2452 39811 1 37478 2544 40022 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
No real differences indicates in accomodation.
stacked_plot(data["no_of_premiums_paid"])
default 0 1 All no_of_premiums_paid 2 511 215 726 3 1510 235 1745 4 2634 271 2905 5 3887 325 4212 6 5315 319 5634 7 6241 379 6620 8 6813 370 7183 9 6805 351 7156 10 6546 326 6872 11 6043 351 6394 12 5113 294 5407 13 4486 266 4752 14 3749 237 3986 15 3085 179 3264 16 2499 179 2678 17 1995 150 2145 18 1679 120 1799 19 1266 89 1355 20 1055 79 1134 21 787 51 838 22 663 50 713 23 470 33 503 24 357 29 386 25 282 23 305 26 226 15 241 27 178 8 186 28 140 12 152 29 111 8 119 30 85 6 91 31 56 5 61 32 45 6 51 33 41 2 43 34 37 1 38 35 29 2 31 36 23 0 23 37 14 0 14 38 5 3 8 39 5 0 5 40 6 0 6 41 5 1 6 42 7 0 7 43 2 1 3 44 4 0 4 45 2 1 3 47 4 1 5 48 3 0 3 49 1 0 1 50 1 2 3 51 3 0 3 52 2 0 2 53 2 0 2 54 2 0 2 55 1 0 1 56 3 0 3 58 2 0 2 59 0 1 1 60 1 0 1 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Default is at 30% at 2 premiums paid then comes down to around 10% until 37. Highest chances of default are 38, 41, 43, 45, 47, 50, and 59.
stacked_plot(data["premium"])
default 0 1 All premium 1200 6402 495 6897 3300 9137 716 9853 5400 10104 828 10932 5700 2810 258 3068 7500 9512 678 10190 9600 7912 504 8416 11700 6464 367 6831 13800 6195 351 6546 15900 2704 127 2831 18000 2915 147 3062 20100 2297 125 2422 22200 1759 80 1839 24300 1305 63 1368 26400 1002 53 1055 28500 772 29 801 30600 456 18 474 32700 653 35 688 34800 372 20 392 36900 347 12 359 39000 249 10 259 41100 195 12 207 43200 152 6 158 45300 130 11 141 47400 127 7 134 49500 92 8 100 51600 79 2 81 53700 66 5 71 55800 50 2 52 57900 180 7 187 60000 399 20 419 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Default percentage stays steady at around 6-10% throughout the observations.
stacked_plot(data["perc_premium_paid_by_cash"])
default 0 1 All perc_premium_paid_by_cash 0.0 7901 219 8120 0.01 4702 62 4764 0.02 3890 59 3949 0.03 3316 51 3367 0.04 2660 47 2707 ... ... ... ... 0.97 333 87 420 0.98 334 88 422 0.99 346 97 443 1.0 4203 1030 5233 All 74837 4996 79833 [102 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
The higher the percentage paid in cash, the higher the chance for default.
stacked_plot(data["residence_area_type"])
default 0 1 All residence_area_type Rural 29662 1997 31659 Urban 45175 2999 48174 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
No real differences indicated.
stacked_plot(data["sourcing_channel"])
default 0 1 All sourcing_channel A 40768 2347 43115 B 15445 1066 16511 C 11136 903 12039 D 6925 634 7559 E 563 46 609 All 74837 4996 79833 ------------------------------------------------------------------------------------------------------------------------
Slightly higher chance of default C, D, and E.
C, D, and E appear to have higher chances of default.
def plot(x,target='default'):
fig,axs = plt.subplots(2,2,figsize=(12,10))
axs[0, 0].set_title('Defaulted')
sns.distplot(data[(data[target] == 1)][x],ax=axs[0,0],color='teal')
axs[0, 1].set_title('Not Defaulted')
sns.distplot(data[(data[target] == 0)][x],ax=axs[0,1],color='orange')
axs[1,0].set_title('Boxplot w.r.t defaulted')
sns.boxplot(data[target],data[x],ax=axs[1,0],palette='gist_rainbow')
axs[1,1].set_title('Boxplot w.r.t defaulted - Without outliers')
sns.boxplot(data[target],data[x],ax=axs[1,1],showfliers=False,palette='gist_rainbow')
plt.tight_layout()
plt.show()
plot('Income_log')
Higher income less likely to default.
sns.set(rc={"figure.figsize": (20, 20)})
sns.heatmap(
data.corr(),
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
Will apply Linear Regression (with over-sampling, under-sampling, and regularization), Decision Tree, Random Forest, Bagging, Adaboost, Gradient Boosting, and XG Boosting. Will determine best performing model based on recall performance and find most important variables in predicting default.
data1=data.copy()
X = data1.drop(['default'],axis=1)
y = data1['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7,stratify=y)
print(X_train.shape, X_test.shape)
(55883, 15) (23950, 15)
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
perc_premium_paid_by_cash 0 age 0 Count_3-6_months_late 0 Count_6-12_months_late 0 Count_more_than_12_months_late 0 Marital Status 0 Veh_Owned 0 No_of_dep 0 Accomodation 0 risk_score 0 no_of_premiums_paid 0 sourcing_channel 0 residence_area_type 0 premium 0 Income_log 0 dtype: int64 ------------------------------ perc_premium_paid_by_cash 0 age 0 Count_3-6_months_late 0 Count_6-12_months_late 0 Count_more_than_12_months_late 0 Marital Status 0 Veh_Owned 0 No_of_dep 0 Accomodation 0 risk_score 0 no_of_premiums_paid 0 sourcing_channel 0 residence_area_type 0 premium 0 Income_log 0 dtype: int64
No missing values.
X_train=pd.get_dummies(X_train,drop_first=True)
X_test=pd.get_dummies(X_test,drop_first=True)
print(X_train.shape, X_test.shape)
(55883, 18) (23950, 18)
Encoded categorical values.
X_train.head()
| perc_premium_paid_by_cash | age | Count_3-6_months_late | Count_6-12_months_late | Count_more_than_12_months_late | Marital Status | Veh_Owned | No_of_dep | Accomodation | risk_score | no_of_premiums_paid | premium | Income_log | sourcing_channel_B | sourcing_channel_C | sourcing_channel_D | sourcing_channel_E | residence_area_type_Urban | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34181 | 0.88 | 28 | 0 | 0 | 2 | 1 | 2 | 1 | 1 | 99.30 | 14 | 9600 | 12.101101 | 0 | 0 | 0 | 0 | 1 |
| 19620 | 0.04 | 33 | 0 | 0 | 0 | 1 | 2 | 3 | 1 | 99.16 | 10 | 20100 | 12.252145 | 0 | 0 | 0 | 0 | 0 |
| 13636 | 0.84 | 36 | 0 | 0 | 0 | 1 | 2 | 3 | 0 | 99.21 | 7 | 5400 | 11.163368 | 0 | 0 | 0 | 0 | 1 |
| 175 | 0.00 | 50 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 98.73 | 15 | 3300 | 11.488940 | 0 | 0 | 0 | 0 | 0 |
| 58530 | 0.05 | 49 | 0 | 0 | 0 | 1 | 2 | 3 | 0 | 99.07 | 5 | 45300 | 13.190265 | 1 | 0 | 0 | 0 | 0 |
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
plt.boxplot(cv_result_bfr)
plt.show()
Poor CV score.
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
make_confusion_matrix(lr,y_test)
Accuracy on training set : 0.9378701930819748 Accuracy on test set : 0.9377870563674322 Recall on training set : 0.10122962539319416 Recall on test set : 0.0980653769179453 Precision on training set : 0.5183016105417276 Precision on test set : 0.5157894736842106
Recall scoring poorly.
from imblearn.over_sampling import SMOTE
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over==1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 3497 Before UpSampling, counts of label 'No': 52386 After UpSampling, counts of label 'Yes': 52386 After UpSampling, counts of label 'No': 52386 After UpSampling, the shape of train_X: (104772, 18) After UpSampling, the shape of train_y: (104772,)
log_reg_over = LogisticRegression(random_state = 1)
log_reg_over.fit(X_train_over,y_train_over)
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
cv_result_over=cross_val_score(estimator=log_reg_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
plt.boxplot(cv_result_over)
plt.show()
get_metrics_score(log_reg_over,X_train_over,X_test,y_train_over,y_test)
make_confusion_matrix(log_reg_over,y_test)
Accuracy on training set : 0.7329630053831176 Accuracy on test set : 0.7433820459290188 Recall on training set : 0.7137594013667774 Recall on test set : 0.5703802535023349 Precision on training set : 0.742267836582364 Precision on test set : 0.13449740443605473
Much better recall performance.
lr_estimator = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
lr_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator.fit(X_train_over, y_train_over)
LogisticRegression(C=0.1, random_state=1, solver='saga')
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
cv_result_estimator=cross_val_score(estimator=lr_estimator, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
plt.boxplot(cv_result_estimator)
plt.show()
get_metrics_score(lr_estimator,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(lr_estimator,y_test)
Accuracy on training set : 0.588582827472989 Accuracy on test set : 0.6027557411273486 Recall on training set : 0.572423930057649 Recall on test set : 0.5557038025350234 Precision on training set : 0.5915412384352869 Precision on test set : 0.08604483007953724
Worse recall performance.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 3497 Before Under Sampling, counts of label 'No': 52386 After Under Sampling, counts of label 'Yes': 3497 After Under Sampling, counts of label 'No': 3497 After Under Sampling, the shape of train_X: (6994, 18) After Under Sampling, the shape of train_y: (6994,)
log_reg_under = LogisticRegression(random_state = 1)
log_reg_under.fit(X_train_un,y_train_un )
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=log_reg_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()
get_metrics_score(log_reg_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_under,y_test)
Accuracy on training set : 0.7590792107520732 Accuracy on test set : 0.8016283924843424 Recall on training set : 0.7034601086645696 Recall on test set : 0.7058038692461641 Precision on training set : 0.7915057915057915 Precision on test set : 0.19709388971684053
Better recall performance.
models = []
models.append(
(
"DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1,)),
]
),
)
)
models.append(
(
"Bagging",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", BaggingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"RF",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"ADB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"GBM",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
]
),
)
)
results = []
names = []
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
DTREE: 23.477825464949927 Bagging: 12.725485387287963 RF: 11.09531984467607 ADB: 16.299856938483547 GBM: 14.698344573881053 XGB: 14.755283057428981
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
DTREE, ADB, GMB, and XGB all have top scores followed by Bagging. RF has lowest.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {'decisiontreeclassifier__max_depth': np.arange(2,30),
'decisiontreeclassifier__min_samples_leaf': [1, 2, 5, 7, 10],
'decisiontreeclassifier__max_leaf_nodes' : [2, 3, 5, 10,15],
'decisiontreeclassifier__min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'decisiontreeclassifier__max_depth': 2, 'decisiontreeclassifier__max_leaf_nodes': 5, 'decisiontreeclassifier__min_impurity_decrease': 0.0001, 'decisiontreeclassifier__min_samples_leaf': 1}
Score: 0.1338373186184345
Wall time: 6min 4s
# Creating new pipeline with best parameters
dtree_tuned1 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(
max_depth=2,
max_leaf_nodes=5,
random_state=1,
min_impurity_decrease=0.0001,
min_samples_leaf=1
),
)
# Fit the model on training data
dtree_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(max_depth=2, max_leaf_nodes=5,
min_impurity_decrease=0.0001,
random_state=1))])
# Calculating different metrics
get_metrics_score(dtree_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(dtree_tuned1, y_test)
Accuracy on training set : 0.9386754469158778 Accuracy on test set : 0.9380375782881002 Recall on training set : 0.13840434658278525 Recall on test set : 0.12074716477651767 Precision on training set : 0.5389755011135857 Precision on test set : 0.521613832853026
Poor recall scoring on both training and test.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1))
# Parameter grid to pass in RandomSearchCV
param_grid = {'decisiontreeclassifier__max_depth': np.arange(2,30),
'decisiontreeclassifier__min_samples_leaf': [1, 2, 5, 7, 10],
'decisiontreeclassifier__max_leaf_nodes' : [2, 3, 5, 10,15],
'decisiontreeclassifier__min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'decisiontreeclassifier__min_samples_leaf': 7, 'decisiontreeclassifier__min_impurity_decrease': 0.0001, 'decisiontreeclassifier__max_leaf_nodes': 10, 'decisiontreeclassifier__max_depth': 17} with CV score=0.1338373186184345:
Wall time: 6.06 s
# Creating new pipeline with best parameters
dtree_tuned2 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(
max_depth=17,
max_leaf_nodes=10,
random_state=1,
min_impurity_decrease=0.0001,
min_samples_leaf=7
),
)
# Fit the model on training data
dtree_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(max_depth=17, max_leaf_nodes=10,
min_impurity_decrease=0.0001,
min_samples_leaf=7, random_state=1))])
# Calculating different metrics
get_metrics_score(dtree_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(dtree_tuned2, y_test)
Accuracy on training set : 0.9386754469158778 Accuracy on test set : 0.9380375782881002 Recall on training set : 0.13840434658278525 Recall on test set : 0.12074716477651767 Precision on training set : 0.5389755011135857 Precision on test set : 0.521613832853026
Poor recall scoring on training and test.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), BaggingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
'baggingclassifier__max_samples': [0.7,0.8,0.9,1],
'baggingclassifier__max_features': [0.7,0.8,0.9,1],
'baggingclassifier__n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'baggingclassifier__max_features': 0.9, 'baggingclassifier__max_samples': 0.7, 'baggingclassifier__n_estimators': 50}
Score: 0.13068996525648888
Wall time: 3min 43s
# Creating new pipeline with best parameters
bagg_tuned1 = make_pipeline(
StandardScaler(),
BaggingClassifier(
max_features=0.9,
max_samples=0.7,
random_state=1,
n_estimators=50,
),
)
# Fit the model on training data
bagg_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(max_features=0.9, max_samples=0.7,
n_estimators=50, random_state=1))])
# Calculating different metrics
get_metrics_score(bagg_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(dtree_tuned1, y_test)
Accuracy on training set : 0.9893706493924808 Accuracy on test set : 0.9376200417536534 Recall on training set : 0.8301401201029454 Recall on test set : 0.1000667111407605 Precision on training set : 1.0 Precision on test set : 0.5084745762711864
Much better recall scoring on training and test.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), BaggingClassifier(random_state=1))
# Parameter grid to pass in RandomSearchCV
param_grid = {
'baggingclassifier__max_samples': [0.7,0.8,0.9,1],
'baggingclassifier__max_features': [0.7,0.8,0.9,1],
'baggingclassifier__n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'baggingclassifier__n_estimators': 50, 'baggingclassifier__max_samples': 0.7, 'baggingclassifier__max_features': 0.9} with CV score=0.13068996525648888:
Wall time: 2min 15s
# Creating new pipeline with best parameters
bagg_tuned2 = make_pipeline(
StandardScaler(),
BaggingClassifier(
max_features=0.9,
random_state=1,
max_samples=0.7,
n_estimators=50
),
)
# Fit the model on training data
bagg_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(max_features=0.9, max_samples=0.7,
n_estimators=50, random_state=1))])
# Calculating different metrics
get_metrics_score(bagg_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(bagg_tuned2, y_test)
Accuracy on training set : 0.9893706493924808 Accuracy on test set : 0.9376200417536534 Recall on training set : 0.8301401201029454 Recall on test set : 0.1000667111407605 Precision on training set : 1.0 Precision on test set : 0.5084745762711864
Good recall scoring on training and test.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"randomforestclassifier__n_estimators": [100,150,250],
"randomforestclassifier__min_samples_leaf": np.arange(1, 6),
"randomforestclassifier__max_features": [np.arange(0.3, 0.6, 0.1),'sqrt','log2'],
"randomforestclassifier__max_samples": np.arange(0.2, 0.6, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__max_samples': 0.5000000000000001, 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__n_estimators': 100}
Score: 0.10895605967708973
Wall time: 10min 2s
# Creating new pipeline with best parameters
rf_tuned1 = make_pipeline(
StandardScaler(),
RandomForestClassifier(
n_estimators=100,
max_features='sqrt',
random_state=1,
max_samples=0.5000000000000001,
min_samples_leaf=1
),
)
# Fit the model on training data
rf_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('randomforestclassifier',
RandomForestClassifier(max_features='sqrt',
max_samples=0.5000000000000001,
random_state=1))])
# Calculating different metrics
get_metrics_score(rf_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(rf_tuned1, y_test)
Accuracy on training set : 0.9693287761931177 Accuracy on test set : 0.9384968684759917 Recall on training set : 0.5098655990849299 Recall on test set : 0.08939292861907938 Precision on training set : 1.0 Precision on test set : 0.5537190082644629
Good recall scoring on test set. Worse on training.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1))
# Parameter grid to pass in RandomSearchCV
param_grid = {
"randomforestclassifier__n_estimators": [100,150,250],
"randomforestclassifier__min_samples_leaf": np.arange(1, 6),
"randomforestclassifier__max_features": [np.arange(0.3, 0.6, 0.1),'sqrt','log2'],
"randomforestclassifier__max_samples": np.arange(0.2, 0.6, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'randomforestclassifier__n_estimators': 100, 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__max_samples': 0.5000000000000001, 'randomforestclassifier__max_features': 'sqrt'} with CV score=0.10695319844676068:
Wall time: 2min 25s
# Creating new pipeline with best parameters
rf_tuned2 = make_pipeline(
StandardScaler(),
RandomForestClassifier(
n_estimators=100,
max_features='sqrt',
random_state=1,
max_samples=0.5000000000000001,
min_samples_leaf=2
),
)
# Fit the model on training data
rf_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('randomforestclassifier',
RandomForestClassifier(max_features='sqrt',
max_samples=0.5000000000000001,
min_samples_leaf=2, random_state=1))])
# Calculating different metrics
get_metrics_score(rf_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(rf_tuned2, y_test)
Accuracy on training set : 0.9558363008428323 Accuracy on test set : 0.9392066805845511 Recall on training set : 0.29911352587932516 Recall on test set : 0.0913942628418946 Precision on training set : 0.9840075258701787 Precision on test set : 0.5930735930735931
Better recall performance on test set and worse on training.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__n_estimators': 100}
Score: 0.17186920089924382
Wall time: 9min 40s
# Creating new pipeline with best parameters
abc_tuned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(abc_tuned1, y_test)
Accuracy on training set : 0.9422006692554086 Accuracy on test set : 0.9345720250521921 Recall on training set : 0.22390620531884473 Recall on test set : 0.15410273515677117 Precision on training set : 0.6027713625866051 Precision on test set : 0.4358490566037736
Poor recall performance.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in RandomSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 70, 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.1675818516247701:
Wall time: 3min 19s
# Creating new pipeline with best parameters
abc_tuned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
n_estimators=70,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=70,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(abc_tuned1, y_test)
Accuracy on training set : 0.9410017357693753 Accuracy on test set : 0.9349478079331942 Recall on training set : 0.2010294538175579 Recall on test set : 0.15276851234156105 Precision on training set : 0.5829187396351575 Precision on test set : 0.44294003868471954
No change in poor recall performance.
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"gradientboostingclassifier__n_estimators": np.arange(75,150,25),
"gradientboostingclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"gradientboostingclassifier__subsample":[0.5,0.7,1],
"gradientboostingclassifier__max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'gradientboostingclassifier__init': DecisionTreeClassifier(random_state=1), 'gradientboostingclassifier__learning_rate': 0.1, 'gradientboostingclassifier__max_features': 0.5, 'gradientboostingclassifier__n_estimators': 75, 'gradientboostingclassifier__subsample': 0.5}
Score: 0.23706519517678318
Wall time: 19min 33s
# Creating new pipeline with best parameters
gbc_tuned1 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1),
learning_rate=0.1,
max_features=0.5,
n_estimators=75,
subsample=0.5,
random_state=1,
),
)
# Fit the model on training data
gbc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, n_estimators=75,
random_state=1, subsample=0.5))])
# Calculating different metrics
get_metrics_score(gbc_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned1, y_test)
Accuracy on training set : 0.9414848880697171 Accuracy on test set : 0.9392066805845511 Recall on training set : 0.15727766657134687 Recall on test set : 0.12741827885256837 Precision on training set : 0.6300114547537228 Precision on test set : 0.5634218289085545
Poor recall performance on test and training sets.
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),GradientBoostingClassifier(random_state=1))
#Parameter grid to pass in RandomSearchCV
param_grid = {
"gradientboostingclassifier__init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"gradientboostingclassifier__n_estimators": np.arange(75,150,25),
"gradientboostingclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"gradientboostingclassifier__subsample":[0.5,0.7,1],
"gradientboostingclassifier__max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'gradientboostingclassifier__subsample': 1, 'gradientboostingclassifier__n_estimators': 100, 'gradientboostingclassifier__max_features': 0.5, 'gradientboostingclassifier__learning_rate': 0.05, 'gradientboostingclassifier__init': DecisionTreeClassifier(random_state=1)} with CV score=0.23706519517678318:
Wall time: 3min 36s
# Creating new pipeline with best parameters
gbc_tuned2 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1),
learning_rate=0.05,
max_features=0.5,
n_estimators=100,
subsample=1,
random_state=1,
),
)
# Fit the model on training data
gbc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.05,
max_features=0.5, random_state=1,
subsample=1))])
# Calculating different metrics
get_metrics_score(gbc_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned2, y_test)
Accuracy on training set : 0.9409122631211638 Accuracy on test set : 0.9391231732776618 Recall on training set : 0.13497283385759223 Recall on test set : 0.10206804536357572 Precision on training set : 0.630173564753004 Precision on test set : 0.5773584905660377
Still poor performance on recall test and training sets.
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'xgbclassifier__gamma': 5, 'xgbclassifier__learning_rate': 0.05, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__subsample': 0.9} with CV score=0.6111011649294911:
Wall time: 4h 57min 43s
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.9,
learning_rate=0.05,
gamma=5,
eval_metric='logloss',
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.05,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned1,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned1, y_test)
Accuracy on training set : 0.8637152622443319 Accuracy on test set : 0.8576617954070981 Recall on training set : 0.6700028595939377 Recall on test set : 0.6010673782521682 Precision on training set : 0.26609880749574105 Precision on test set : 0.2427262931034483
Better recall performance on test and training sets.
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in RandomSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 0.9, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__n_estimators': 200, 'xgbclassifier__learning_rate': 0.01, 'xgbclassifier__gamma': 1} with CV score=0.6076680972818312:
Wall time: 8min 50s
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=200,
scale_pos_weight=10,
learning_rate=0.01,
gamma=1,
subsample=0.9,
eval_metric='logloss',
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned2,X_train,X_test,y_train,y_test)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned2, y_test)
Accuracy on training set : 0.8634826333589821 Accuracy on test set : 0.8569102296450939 Recall on training set : 0.6631398341435516 Recall on test set : 0.5997331554369579 Precision on training set : 0.26442417331813 Precision on test set : 0.2412775093934514
# defining list of model
models = [lr]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of models
models = [log_reg_over, lr_estimator]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_over,X_test,y_train_over,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of model
models = [log_reg_under]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_un,X_test,y_train_un,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of models
models = [dtree_tuned1,dtree_tuned2,bagg_tuned1,bagg_tuned2,rf_tuned1,rf_tuned2,abc_tuned1,abc_tuned2,gbc_tuned1,gbc_tuned2,xgb_tuned1, xgb_tuned2]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"Logistic Regression",
'Logistic Regression on Oversampled data',
'Logistic Regression-Regularized (Oversampled data)',
'Logistic Regression on Undersampled data',
"Decision Tree with GridSearchCV",
"Decision Tree with RandomizedSearchCV",
"Bagging Classifier with GridSearchCV",
"Bagging Classifier with RandomizedSearchCV",
"Random Forest with GridSearchCV",
"Random Forest with RandomizedSearchCV",
"AdaBoost with GridSearchCV",
"AdaBoost Tree with RandomizedSearchCV",
"GradientBoost with GridSearchCV",
"GradientBoost Tree with RandomizedSearchCV",
"XGBoost with GridSearchCV",
"XGBoost with RandomizedSearchCV",
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.937870 | 0.937787 | 0.101230 | 0.098065 | 0.518302 | 0.515789 |
| 1 | Logistic Regression on Oversampled data | 0.732963 | 0.743382 | 0.713759 | 0.570380 | 0.742268 | 0.134497 |
| 2 | Logistic Regression-Regularized (Oversampled d... | 0.588583 | 0.602756 | 0.572424 | 0.555704 | 0.591541 | 0.086045 |
| 3 | Logistic Regression on Undersampled data | 0.759079 | 0.801628 | 0.703460 | 0.705804 | 0.791506 | 0.197094 |
| 4 | Decision Tree with GridSearchCV | 0.938675 | 0.938038 | 0.138404 | 0.120747 | 0.538976 | 0.521614 |
| 5 | Decision Tree with RandomizedSearchCV | 0.938675 | 0.938038 | 0.138404 | 0.120747 | 0.538976 | 0.521614 |
| 6 | Bagging Classifier with GridSearchCV | 0.989371 | 0.937620 | 0.830140 | 0.100067 | 1.000000 | 0.508475 |
| 7 | Bagging Classifier with RandomizedSearchCV | 0.989371 | 0.937620 | 0.830140 | 0.100067 | 1.000000 | 0.508475 |
| 8 | Random Forest with GridSearchCV | 0.969329 | 0.938497 | 0.509866 | 0.089393 | 1.000000 | 0.553719 |
| 9 | Random Forest with RandomizedSearchCV | 0.955836 | 0.939207 | 0.299114 | 0.091394 | 0.984008 | 0.593074 |
| 10 | AdaBoost with GridSearchCV | 0.942201 | 0.934572 | 0.223906 | 0.154103 | 0.602771 | 0.435849 |
| 11 | AdaBoost Tree with RandomizedSearchCV | 0.941002 | 0.934948 | 0.201029 | 0.152769 | 0.582919 | 0.442940 |
| 12 | GradientBoost with GridSearchCV | 0.941485 | 0.939207 | 0.157278 | 0.127418 | 0.630011 | 0.563422 |
| 13 | GradientBoost Tree with RandomizedSearchCV | 0.940912 | 0.939123 | 0.134973 | 0.102068 | 0.630174 | 0.577358 |
| 14 | XGBoost with GridSearchCV | 0.863715 | 0.857662 | 0.670003 | 0.601067 | 0.266099 | 0.242726 |
| 15 | XGBoost with RandomizedSearchCV | 0.863483 | 0.856910 | 0.663140 | 0.599733 | 0.264424 | 0.241278 |
feature_names = X_train.columns
importances = xgb_tuned1[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
feature_names = X_train.columns
importances = xgb_tuned2[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
importance = log_reg_under.coef_[0]
feat_importances = pd.Series(importance, index = feature_names)
feat_importances.nlargest(4).plot(kind='barh',title = 'Feature Importance')
<AxesSubplot:title={'center':'Feature Importance'}>
Predictive model built to: